Text normalization based on statistical machine translation and internet user support
نویسندگان
چکیده
In this paper, we describe and compare systems for text normalization based on statistical machine translation (SMT) methods which are constructed with the support of internet users. Internet users normalize text displayed in a web interface, thereby providing a parallel corpus of normalized and nonnormalized text. With this corpus, SMT models are generated to translate non-normalized into normalized text. To build traditional language-specific text normalization systems, knowledge of linguistics as well as established computer skills to implement text normalization rules are required. Our systems are built without profound computer knowledge due to the simple self-explanatory user interface and the automatic generation of the SMT models. Additionally, no inhouse knowledge of the language to normalize is required due to the multilingual expertise of the internet community. All techniques are applied on French texts, crawled with our Rapid Language Adaptation Toolkit [1] and compared through Levenshtein edit distance [2], BLEU score [3], and perplexity.
منابع مشابه
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملText Normalization Using Hybrid Approach
Machine Translation (MT) was an important area of Natural Language Processing that dealt with the translation of one natural language to another language. In this paper we were presenting the research on Translation of short messages to Plain English Text Messages. In today’s world where communication over the internet had increased by using various types of websites and another internet applic...
متن کاملArchitecture for Text Normalization using Statistical Machine Translation techniques
This paper proposes an architecture, based on statistical machine translation, for developing the text normalization module of a text to speech conversion system. The main target is to generate a language independent text normalization module, based on data and flexible enough to deal with all situations presented in this task. The proposed architecture is composed by three main modules: a toke...
متن کاملA REVIEW PAPER ON SMS TEXT TO PLAIN ENGLISH TRANSLATION(Text Normalization)
Mobile technology as well as social networking technology plays an important role in communication across internet. A large amount of information is found in noisy contexts as texting and chat lingo have become increasingly considerably in the past decade. This noisy information needs to be normalized into the standard text so that it can be used by the various other tools such as text-to-speec...
متن کاملNormalizing Medieval German Texts: from rules to deep learning
The application of NLP tools to historical texts is complicated by a high level of spelling variation. Different methods of historical text normalization have been proposed. In this comparative evaluation I test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010